# %pip install optuna --quiet
There is a separate script assignment_2.py which handles execution of the optuna studies. In this notebook, we load these results and briefly discuss them.
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.datasets as dsets
import random as rand
import matplotlib.pyplot as plt
%matplotlib inline
from tqdm import tqdm
import numpy as np
import optuna
from optuna.trial import TrialState
import joblib
from optuna.visualization import plot_contour
from optuna.visualization import plot_edf
from optuna.visualization import plot_intermediate_values
from optuna.visualization import plot_optimization_history
from optuna.visualization import plot_parallel_coordinate
from optuna.visualization import plot_param_importances
from optuna.visualization import plot_slice
# reject randomness (as much as possible)
manualSeed = 2021
np.random.seed(manualSeed)
rand.seed(manualSeed)
torch.manual_seed(manualSeed)
if torch.cuda.is_available():
torch.cuda.manual_seed(manualSeed)
torch.cuda.manual_seed_all(manualSeed)
torch.backends.cudnn.enabled = False
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
def _init_fn():
np.random.seed(manualSeed)
### it should be beneficial to use some data augmentation here
transform_train = transforms.Compose(
[transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
# legends say that these are the true values for CIFAR10
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.247, 0.243, 0.261))])
transform_valid = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.247, 0.243, 0.261))])
trainset = dsets.CIFAR10('data', train=True, download=True, transform=transform_train)
testset = dsets.CIFAR10('data', train=False, download=True, transform=transform_valid)
Files already downloaded and verified Files already downloaded and verified
BATCH_SIZE = 4096
train_loader = torch.utils.data.DataLoader(dataset=trainset,
batch_size=BATCH_SIZE,
shuffle=True,
worker_init_fn=_init_fn)
valid_loader = torch.utils.data.DataLoader(dataset=testset,
batch_size=BATCH_SIZE,
shuffle=False,
worker_init_fn=_init_fn)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device is : {device}")
Device is : cuda
# load first study an summarize result
study = joblib.load("CIFAR10_study.pkl")
pruned_trials = study.get_trials(deepcopy=False, states=[TrialState.PRUNED])
complete_trials = study.get_trials(deepcopy=False, states=[TrialState.COMPLETE])
print("Study statistics: ")
print(" Number of finished trials: ", len(study.trials))
print(" Number of pruned trials: ", len(pruned_trials))
print(" Number of complete trials: ", len(complete_trials))
print("Best trial:")
trial = study.best_trial
print(" Value: ", trial.value)
print(" Params: ")
for key, value in trial.params.items():
print(" {}: {}".format(key, value))
Study statistics:
Number of finished trials: 375
Number of pruned trials: 238
Number of complete trials: 136
Best trial:
Value: 54.87999956665039
Params:
z_wide_dim: 2000
z_narrow_dim: 1000
p_drop_in: 0.0
p_drop_hidden: 0.4
sigma_in: 0.5
sigma_hidden: 0.0
num_z_layers: 1
learning_rate: 0.0002235891877402582
weight_decay: 0.00010013574308045192
These values seem strange at first glance. The paper that this architecture is taken from uses 3 z-layers and achieves a much better accuracy than roughly $0.55$. Also, in our first tests with this architecture, we needed only a few attempts at manual fine-tuning to get better than $0.6$ accuracy. Therefore, we examine the results below.
# first plot how the score improved over time
plot_optimization_history(study)
The plot above looks strange, apparently the trials close to the best one where already found very early, then no improvement followed. Next, we check which importance each parameter was assigned.
plot_param_importances(study)
It seems that the number of z-layers was the most important parameter. We suspect that this parameter was therefore always set to 1, we can quickly verify this by getting a histogram over all parameter values for the number of z-layers.
n_z_layers_hist = []
for trial in study.trials:
n_z_layers_hist.append(trial.params['num_z_layers'])
plt.hist(np.array(n_z_layers_hist),bins=3)
plt.show()
Indeed, the number of z-layers was 1 in the vast majority of trials. This very important parameter seems entirely underexplored. We conduct another study, this time setting the number of z-layers to a minimum of 2, and also using an experimental multivariate mode in the optuna study sampler. We load this, and perform the same checks.
# load second study and summarize result
# please ignore the file name, we are not sure if this study is indeed 'improved'
study2 = joblib.load("CIFAR10_study_improved.pkl")
pruned_trials = study2.get_trials(deepcopy=False, states=[TrialState.PRUNED])
complete_trials = study2.get_trials(deepcopy=False, states=[TrialState.COMPLETE])
print("Study statistics: ")
print(" Number of finished trials: ", len(study2.trials))
print(" Number of pruned trials: ", len(pruned_trials))
print(" Number of complete trials: ", len(complete_trials))
print("Best trial:")
trial2 = study2.best_trial
print(" Value: ", trial2.value)
print(" Params: ")
for key, value in trial2.params.items():
print(" {}: {}".format(key, value))
Study statistics:
Number of finished trials: 425
Number of pruned trials: 343
Number of complete trials: 81
Best trial:
Value: 60.33999990844727
Params:
z_wide_dim: 2000
z_narrow_dim: 400
p_drop_in: 0.0
p_drop_hidden: 0.0
sigma_in: 0.25
sigma_hidden: 0.0
num_z_layers: 2
learning_rate: 0.0001468265237818208
weight_decay: 0.0003678002649684844
These values seem more realistic, and the best trial is close to the performance that we got by manual tuning. Let's check this study, too.
# first plot how the score improved over time
plot_optimization_history(study2)
Again, good values were found very quickly. It seems that optuna is good at finding (local) maxima very quickly. However, from the paper we are basically trying to reproduce, it seems that an even higher accuracy should be possible with 3 z-layers. Therefore, lets look at the number of z-layers again.
plot_param_importances(study2)
n_z_layers_hist = []
for trial in study2.trials:
n_z_layers_hist.append(trial.params['num_z_layers'])
plt.hist(np.array(n_z_layers_hist),bins=2)
plt.show()
These stats look more healthy now. However, the number of automatically pruned trials, meaning trials which did not show a good accuracy quickly enough, is very high for this study. We suspect that the pruning got rid of a lot of potentially promising 3-layer models, as they take longer to train. Next time, we will be more careful with the pruning.
# load the values as found by the study and retrain the network
INPUT_DIM = 3072 # Immutable
Z_WIDE_DIM = trial2.params['z_wide_dim']
Z_NARROW_DIM = trial2.params['z_narrow_dim']
OUTPUT_DIM = 10 # Immutable
LEARNING_RATE = trial2.params['learning_rate']
WEIGHT_DECAY = trial2.params['weight_decay']
P_DROP_IN = trial2.params['p_drop_in']
P_DROP_HIDDEN = trial2.params['p_drop_hidden']
SIGMA_IN = trial2.params['sigma_in']
SIGMA_HIDDEN = trial2.params['sigma_hidden']
NUM_Z_LAYERS = trial2.params['num_z_layers']
# network architecture stolen from: https://openreview.net/pdf/1WvovwjA7UMnPB1oinBL.pdf
class GaussianNoise(nn.Module):
def __init__(self, sigma=0.1, is_relative_detach=True):
super().__init__()
self.sigma = sigma
self.is_relative_detach = is_relative_detach
self.register_buffer('noise', torch.tensor(0))
def forward(self, x):
if self.training and self.sigma != 0:
scale = self.sigma * x.detach() if self.is_relative_detach else self.sigma * x
sampled_noise = self.noise.expand(*x.size()).float().normal_() * scale
x = x + sampled_noise
return x
class LogisticRegressionModel(nn.Module):
def __init__(self, input_dim=INPUT_DIM, z_wide_dim=Z_WIDE_DIM, z_narrow_dim=Z_NARROW_DIM, output_dim=OUTPUT_DIM, p_drop_in=P_DROP_IN, p_drop_hidden=P_DROP_HIDDEN, sigma_in=SIGMA_IN, sigma_hidden=SIGMA_HIDDEN, num_z_layers=NUM_Z_LAYERS):
super().__init__()
self.input_dim = input_dim
layers = [GaussianNoise(sigma=sigma_in),
nn.Dropout(p=p_drop_in),
nn.Linear(input_dim, z_wide_dim),
GaussianNoise(sigma=sigma_hidden),
nn.Dropout(p=p_drop_hidden)]
for i in range(num_z_layers):
layers += [nn.Linear(z_wide_dim, z_narrow_dim, bias=False),
nn.ReLU(),
GaussianNoise(sigma=sigma_hidden),
nn.Dropout(p=p_drop_hidden),
nn.Linear(z_narrow_dim, z_wide_dim),
GaussianNoise(sigma=sigma_hidden),
nn.Dropout(p=p_drop_hidden)]
layers += [nn.Linear(z_wide_dim, output_dim)]
self.model = nn.Sequential(*layers)
def forward(self, x):
x_flat = x.view(-1, self.input_dim)
out = self.model(x_flat)
return out
def count_parameters(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)
model = LogisticRegressionModel()
model = model.to(device)
# Actually enforce 0 mean weights init, this is often quite far off.
with torch.no_grad():
for param in model.parameters():
param.data = param.data - param.data.mean()
criterion = nn.CrossEntropyLoss().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)
print('Number of trainable parameters:', count_parameters(model))
Number of trainable parameters: 9370010
N_EPOCHS = 180
UPDATE_EVERY = 1
train_loss_hist = []
train_acc_hist = []
valid_loss_hist = []
valid_acc_hist = []
progress_bar = tqdm(range(N_EPOCHS), total=N_EPOCHS, position=0, leave=True)
for epoch in progress_bar:
train_loss_list, train_acc_list, batch_sizes_train = [], [], []
valid_loss_list, valid_acc_list, batch_sizes_valid = [], [], []
model.train()
for imgs, labels in train_loader:
optimizer.zero_grad()
labels = labels.to(device)
pred_labels = model(imgs.to(device))
loss = criterion(pred_labels, labels)
loss.backward()
optimizer.step()
train_loss_list.append(loss.item())
batch_sizes_train.append(labels.shape[0])
pred_label_inds = pred_labels.argmax(dim=-1)
acc = (pred_label_inds == labels).float().mean() * 100
train_acc_list.append(acc.item())
batch_sizes_train = np.array(batch_sizes_train)
train_loss_list = np.array(train_loss_list)
weighted_train_loss_list = train_loss_list*batch_sizes_train
train_loss_hist.append(weighted_train_loss_list.sum()/batch_sizes_train.sum())
train_acc_list = np.array(train_acc_list)
weighted_train_acc_list = train_acc_list*batch_sizes_train
train_acc_hist.append(weighted_train_acc_list.sum()/batch_sizes_train.sum())
model.eval()
for imgs, labels in valid_loader:
with torch.no_grad():
labels = labels.to(device)
pred_labels = model(imgs.to(device))
loss = criterion(pred_labels, labels)
valid_loss_list.append(loss.item())
batch_sizes_valid.append(imgs.shape[0])
pred_label_inds = pred_labels.argmax(dim=-1)
acc = (pred_label_inds == labels).float().mean() * 100
valid_acc_list.append(acc.item())
batch_sizes_valid = np.array(batch_sizes_valid)
valid_loss_list = np.array(valid_loss_list)
weighted_valid_loss_list = valid_loss_list*batch_sizes_valid
valid_loss_hist.append(weighted_valid_loss_list.sum()/batch_sizes_valid.sum())
valid_acc_list = np.array(valid_acc_list)
weighted_valid_acc_list = valid_acc_list*batch_sizes_valid
valid_acc_hist.append(weighted_valid_acc_list.sum()/batch_sizes_valid.sum())
if(epoch % UPDATE_EVERY == 0 or epoch == N_EPOCHS-1):
progress_bar.set_description(f"Epoch {epoch+1} - Train: loss {train_loss_hist[-1]:.4f} | acc {train_acc_hist[-1]:.2f} - Valid: loss {valid_loss_hist[-1]:.4f} | acc {valid_acc_hist[-1]:.2f}. ")
Epoch 180 - Train: loss 0.9004 | acc 67.90 - Valid: loss 1.1491 | acc 60.34. : 100%|██████████| 180/180 [30:36<00:00, 10.20s/it]
Plot the learning curves for loss and accuracy.
plt.plot(train_loss_hist, label='train loss')
plt.plot(valid_loss_hist, label='valid loss')
plt.legend()
plt.show()
plt.plot(train_acc_hist, label='train acc')
plt.plot(valid_acc_hist, label='valid acc')
plt.legend()
plt.show()
Also show the confusion matrix.
import pandas as pd
valid_loader = torch.utils.data.DataLoader(dataset=testset,
batch_size=10000,
shuffle=False,
worker_init_fn=_init_fn)
images, labels = next(iter(valid_loader))
model.eval()
labels = labels.to(device)
pred_labels = model(images.to(device))
pred_label_inds = pred_labels.argmax(dim=-1)
accuracy = (pred_label_inds == labels).float().mean() * 100
print(f'Classification accuracy: {round(accuracy.item(), 2)}%')
classes = ('plane', 'car', 'bird', 'cat',
'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
labels_strings = [classes[labels.cpu()[i]] for i in range(10000)]
pred_strings = [classes[pred_label_inds.cpu()[i]] for i in range(10000)]
labels_data = pd.Categorical(labels_strings)
pred_data = pd.Categorical(pred_strings)
print('Confusion Matrix:')
df_confusion = pd.crosstab(labels_data, pred_data, rownames=['Actual'], colnames=['Predicted'], margins=True)
df_confusion.head(10)
Classification accuracy: 60.34% Confusion Matrix:
| Predicted | bird | car | cat | deer | dog | frog | horse | plane | ship | truck | All |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Actual | |||||||||||
| bird | 452 | 19 | 77 | 83 | 66 | 131 | 41 | 82 | 19 | 30 | 1000 |
| car | 11 | 732 | 14 | 6 | 7 | 8 | 3 | 26 | 69 | 124 | 1000 |
| cat | 63 | 25 | 415 | 50 | 196 | 107 | 51 | 25 | 24 | 44 | 1000 |
| deer | 112 | 19 | 55 | 447 | 43 | 152 | 90 | 43 | 29 | 10 | 1000 |
| dog | 38 | 8 | 194 | 35 | 507 | 74 | 65 | 32 | 21 | 26 | 1000 |
| frog | 47 | 21 | 70 | 38 | 44 | 713 | 14 | 22 | 17 | 14 | 1000 |
| horse | 28 | 12 | 39 | 59 | 62 | 28 | 706 | 27 | 9 | 30 | 1000 |
| plane | 38 | 37 | 25 | 29 | 10 | 15 | 18 | 633 | 134 | 61 | 1000 |
| ship | 4 | 53 | 13 | 18 | 12 | 6 | 11 | 62 | 781 | 40 | 1000 |
| truck | 10 | 163 | 25 | 8 | 13 | 15 | 27 | 34 | 57 | 648 | 1000 |
The accuracy here indeed reaches 60%, but the learning curve clearly shows overfitting. For the next time, we think it would be beneficial to forego optuna's automatic pruning and instead implement a stopping criterion for the model training, for example something like 'early' stopping based on the test set accuracy. Then, trial results for different numbers of z layers would be more comparable, and the hyperparameter optimization should have an easier time finding configurations which are a little better. However, 60% is probably fine here, and the paper the architecture comes from might have achieved better results because we did not replicate their training process, which is more involved than what we used here.
Comment from Jan on optuna: I really like the way that the parameters to be optimized are defined in the code. However, I was unable to find a way to have several processes work on the same study in a thread-safe manner without having to use some additional sqlite database, which seems very inconvenient to me. WandB offers the same visualizations, allows conveniently checking the study progress online with a superb git integration, and has a trivial setup process for configuring multiple agents, across several gpus and even machines, to work on the same study. Also, having used it countless times, I have never had an instance of a problem where 'computer science student descent' was more effective than the parameters that WandB would find after sweeping for a while. I remain somewhat unconvinced that optuna is the better option.